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THE MEAN-SQUARE ERROR OPTIMAL LINEAR DISCRIMINANT FUNCTION AND 
ITS APPLICATION TO INCOMPLETE DATA VECTORS 


1 . INTRODUCTION 

In many pattern recognition problems, it is desirable to map observed data 
vectors in n-dimensionil Euclidean space R^, n > 1, to in order that, hope- 
fully, either the efficiency of classification will be increased by classifying 
the transformed observations in R^ or new insights into the structure of the 
data will be gained by literally viewing the transformed observations in 
r\ a map used for such purposes should be simple in structure while having 
the property that transformed observations from the different populations 
under consideration are as well separated as possible. Consequently, such 
a map is often chosen to be linear^ and, in some sense, optimal from the 
point of view of separating the transformed observations of the populations 
at hand. 

In some applications, one may expect to encounter data vectors which are 
incomplete in the sense that the values of one or more of the components 
of these vectors are unknown or missing. In such circumstances, both the 
choice and the implementation of a linear map from R*^ to R^ require special 
consideration. Indeed, once a linear map from R^ to R^ has been chosen, it 
is necessary that incomplete data vectors be mapped to R^ in such a way that 
their images are statistically compatible with the linear images of complete 
data vectors. Furthermore, if n is large and if components of data vectors 
can be missing at random, then it is generally impossible to prepare in 
advance enough "compatible" maps to meet every missing-component eventuality. 
Consequently, it seems advantageous in such circumstances to choose a linear 
map from R*^ to R^ for which one can easily obtain a "compatible" linear map 
appropriate for each incomplete data vector as it is encountered. 


^Throughout this memorandum, the term "linear" is used in reference to affine 
maps (linear plus constant) as well as to maps which are truly linear. 
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When there are only two populations under consideration, the mean-square 
error (MSE) optimal linear discriminant function (LDF) is a linear mapping 
from to which provides, in a certain sense, optimal separation in R^ 
of the transformed observations from the two populations. It is particularly 
suitable for applications in which incomplete data vectors occur for two 
reasons. 

k 1 

The first reason is that, for every k, the MSE-optimal LDF from R to R 
imposes certain statistical properties on transformed observations in R^ . 

One consequence of these properties is that the MSE-optimal LDF appropriate 
for each subset of the components of a data vector is automatically compatible 
from the point of view of classification with the MSE-optimal LDF for a full 
data vector. Another consequence is that a map closely related to the MSE- 
optimal LDF, referred to in the following as the "derived map," can be 
defined for each subset of the components of a data vector and is guaranteed 
to be compatible with the derived map for a full data vector in a way 
appropriate for viewing the structure of transformed observations in R^ . 

The second reason is that, as each incomplete data vector is encountered, 
both the MSE-optimal LDF and the corresponding derived map appropriate for 
the known components of the vector can be determined relatively easily from 
a relatively small amount of stored information. 

In the following sections, the MSE approach to classifier design is reviewed 

and specialized to the two-class case and the linear discriminant function. 

(Much of the material offered in this connection is either standard or an 

adaptation of standard material to suit the present purposes (see ref. 1).) 

Observing the statistical properties which the MSE-optimal LDF imposes on 

transformed observations in r\ the derived maps are defined and their 

statistical properties are discussed. We conclude by describing three 

methods: a "straightforward" method, a method due to Kittler (ref. 2), and 

2 

a method proposed by Golub by which the MSE-optimal LDF and the corresponding 


2 

Private communication. 
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derived map appropriate for a subset of the components of a vector can be con- 
structed with relatively little computation and storage. The relative advan- 
tages of the three methods depend In general on the particular application at 
hand In a fairly complex way; this dependence Is discussed In some detail. 



2. THE MSE APPROACH TO CLASSIFIER DESIGN 


Suppose that x = (x^, •••, x^)^ is an observation in R*^ known to be on one 
of m statistical populations •**, If the cost of misclassifying an 
observation is taken to be 1 , then the Bayes optimal classification rule is 
the following: assign x to if and only if 

P(oJ,/x) = P(Uj/x) 

where P(u)^/x) denotes the posterior probability that x is an observation on 
0 ). . It is often difficult or impossible to evaluate the posterior proba- 
bilities P(uj^./x) and therefore one must frequently deal with approximations 
of these probabilities in implementing the Bayes rule. Such approximations 
are conmonly of the form 

P(w^/x) ^ al^>(x) ; 1 = 1, m (1) 

where 4>{x) = [(}>i(x), •••, (J)^(x)]^ is a vector whose components are conveniently 

chosen linearly independent functions of x, and for each 1, 

a^. = ( 3 ^^, 3^.^)^ is a parameter vector determined so that approximation 

(1) is optimal in some sense. The classification rule determined by a set 

of such so-called discriminant functions al$ is the following: assign x 

to 0 ). if and only if al$(x) = "’^^al$(x). 


In the MSE approach to classifier design, one attempts to determine parameter 
vectors a^ which minimize the MSE of approximation (1), given by 


J(A) 



- A^4>(x) 


^p(x)dx 


where 

P(x) = [P(£D^ |x), P(u)|^[x)]^ 

p(x) = the unconditional probability density function of x 
i I = the Euclidean norm, i.e., lul^ = u^u 
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In practice, J(A) can seldom be evaluated exactly. Typically, a labeled 
sample x * ^^k^k=l ••• N Independent observations on the mixture of 
***’ ‘^m 9 ^ven, and the objective function 


II 

J(A) = 2 «k - A^t(x^) 


k=l 


is minimized, where aj^ = ^“kl ’ ***’ “km^^ defined by 


“kj " 


1 if x^eo). ^ 


0 if X|^^o). 


; j = 1, 


, m 


For large samples, the minimizer of J should be approximately the same as 
the minimizer of J. Indeed, if one denotes by the subset of x consisting 
of observations on w. and by the number of observations in x^» then 


^ J(A) 


XkCX 


A^$(x, 


m 


Ni 


fTT E 4»(Xk) 


+ 1 




It follows from the strong law of large numbers (see ref. 3) that with 
probability 1, 

lim i J(A) = J Ia^$(x) ^p(x)dx - 2 f P(x) V<l>(x)p(x)dx + 1 


Since this expression differs from J(A) by a constant independent of A, it 
has the same minimizer as J. 

A necessary condition for A to be a minimizer of J is that all partial 
derivatives of J with respect to the entries of a vanish. This is equivalent 
to the condition that SA = B, where 

N 

S = ^ <l>(X|^)<I>{xj^)^ 
k=l 
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and 


N 

B - Y, ♦(*k)“k 
k«l 

If S is nonsingular, this condition is sufficient as well as necessary, and 
J has a unique minimizer, 

A = S'^B (2) 

Since the functions 0 ^, •••, are linearly independent, the matrix 

j $(x)$(x)^p{x)dx 

is nonsingular, and it follows that, with probability 1, S is nonsingular 
for sufficiently large N. In fact, if <))i, •••, are real -analytic as well 
as linearly independent, S is nonsingular with probability 1 whenever N > r. 
(See Appendix 2 of ref. 4.) Thus it is reasonable to assume in the following 
that S is nonsingular and that the unique minimizer of J is given by eq. (2). 
The discriminant functions aj$ determined by eq. (2) are referred to as the 
MSE-optimal discriminant functions. 
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3. THE MSE-OPTIMAL LOF IN THE TWO-CLASS CASE 


Suppose that there are only two statistical populations under consideration. 
The classification rule determined by discriminant functions a[$ and aj<l> 
can be phrased in the following way: assign x to w, if and only if 
a <I>(x) > 0, where a = a^ - ag. If a^4> and ag^ are the MSE-optimal discrim- 
inant functions determined by eq. (2) on the basis of some labeled sample, 
then a can be obtained by right-multiplying both sides of eq. (2) by 
(1 , -1 to yield 


where 


a = S’^b 


* f 

k=l 


(3) 


and 




+1 if Xj^eu)^ 

-1 if X|^eu)2 


For a so defined, a $ is referred to as the MSE-optimal discriminant function. 

The MSE-optimal LDF is the MSE-optimal discriminant function a^4» obtained by 
defining $(x) =(][)• (Vectors and matrices are expressed in partitioned 
forms whose meanings should be clear from the context.) In this section, an 
explicit expression is derived for the MSE-optimal LDF in terms of the obser- 
vations in a given labeled sample. 

One easily obtains 




Ni - N2 


Ni - N2 

b = 

E 

*k - "k 

= 

N.m. - N_m 



Xk^Xj 
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s = 


where 


N 

k=l 


N N 


Lvk 


Lk=l k*l 


Nm' 


T T 

Nm Sy^ + N^miin^ + N 2*1121112 




X, ; i * 1 . 2 




m 




k=l 


and 


5w = 2 Vl • Nl'"l'"l - '<*'"2 

k=l 


The matrix is called the "wi thin-class scatter matrix" and can be written 
as 


Sy = $1 + $2 


where 


H (*k - "’(H>'k ■ 


is the scatter matrix for 


Setting a = where a^ is a scalar and a'cR", one sees that eq. ( 3 ) 




is equivalent to 
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Nm' 


Nm Sy + + N2fn2'’’2 



•0 


■ "1 ■ "2 ■ 



m 



a* 

to m 


N^m^ — N2m2 

to « 


This equation yields 

®0 ' F ^^1 ■ ^2^ ■ 

Sj^a' * - N2ni2 - Nma^ - + N2m2m2)a' 

Substituting eq. (4) into eq. (5), one obtains 

T T T 

Sy a' * - N2ni2 - (N^ - N2)m + (Nmm - - N2m2in2)a' 


(4) 

(5) 


N^N2 


NiN2 


T., 


= 2 -|j— (m^ - 1H2) - (m^ - 1^2) (m^ - m2) a 

= [2 - (m^ - m 2 )^a'](m^ - m 2 ) 


( 6 ) 


One sees from eq. (6) that, except for an unimportant scale factor, a' is 
the Fisher linear discriminant S^^m^ - m 2 ). 

Writing a' = X5y^(m^ - m 2 ) and substituting this expression In eq. (6), one 
obtains 

X(m, - mj) = ^[2 - X||t», - - nij) 

2 T -I 

where the vector norm || [[ is defined by lluj] - u u. It follows 
from this equation that 
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or 


\ * 


2N^Ng 


(n + N^N2|lm, - m^irj 
2 


Thus 


a‘ = 


Substituvlng eq. (7) into eq. (4) 


[m^ - 1TI2I 


"j-r^W ^ (m^ - m2) 


N2 ■ "’2* 1 ) 

M- 

Using algebra, one obtains the simpler expression 


.-1 


(m^ - m2) 


®0 " 


+ 11^,11^1 


I 1 ^ 1 1- 1 1 2 

r~T 


(r^ ^ N 2 ^ '"’1 ”'"2^1 ) 

From eqs. (7) and (8), the MSE-optimal LDF is seen to be 
a^<I>(x) = •*• a'^x 

(n 7* !i"2ii')-j5* ii»iii') 

lifj" il"') ■ ”21 I 

(nil ••m2) Sy X 


Jc-1 


/-Lf -1- + 
(^1 ^2 


[m^ - m2 1 


) 


(7) 


(8) 


(9) 
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4. STATISTICAL PROPERTIES IMPOSED BY THE MSE-OPTIMAL 
LOF AHO THE DERIVED HAP 


Once a map from to R^ has been chosen, it Is necessary to map Incomplete 
data vectors to R^ in such a way that their Images are statistically 
compatible with the Images of cwnplete data vectors. The phrase "statistlcany 
compatible" should be Interpreted In a way appropriate for the Intended 
oojectlve of mapping data vectors in r'^ to r\ whether the objective Is 
effeclent classification or gaining new Insights into the structure of the 
data set. Certain statistical properties im;x}sed by the r>SE~cptima1 LI^ on 
transformed observations in R^ and the implications of these properties in 
determining "statistically compatible" f-m11ies of maps will now be discussed. 


Suppose that efficient classification is the objective of mapping data 
vectors in R^' to R^. Let L(x) * a^4»(x) denote the MSE-optimal LOF as 
determined by eq. (9) on the basis of a labeled sample of observations on 
two populations and U 2 * According to the classification rule associated 
with L, zero is the threshold for discriminating between transformed 
observations on and transformed observations on S'ihce this is true 
independent of n, it follcws that the MSE-optimal LDF appropriate for each 
subset of the components of a data vector has the same associated classi- 
fication regions in R^ as the MSE-optimal LDF for a full data vector. Thus 
the family of all MSE-optimal LDF's appropriate for subsets of components 
of data vectors may be regarded as being statistically compatible in a way 
appropriate for this objective. 


Now suppose that the primary objective of mapping data vectors in h” to R^ is 
viewing the data. It is easily verified that 


L(m^) = 


^ - it llm, - mjll^ 

^ ^ t I Imi - 


( 10 ) 
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and 


■ N “ I f"*! " ^ 


L(m,) = 


- 2 


1 


Ih “'^21 


(11) 


From eqs. (10) and (11), one sees that L(m^) and L(ni 2 ) are in (-1, 1) and 
lie symmetrically to the left and right, respectively, of 


1 


1 


No N 


+ fT+ llm, - m2ir 


Note that if N^/N 2 is large, then both L(m^) and L(m 2 ) are near +1. If this 
ratio is small, then both L(m^) and L(m 2 ) are near -1. If N.| = N 2 » then 
L(m^) = -L(m 2 ). 


As an interesting aside, explore the limiting behavior of L(m^) and L(m 2 ) 
for increasingly large samples. Suppose that N^ and N 2 grow large in such 
a way that 


N 


lim ^ = a, 
N-k« ' 


and 


1 im 

N-w 


- a. 


for some nonnegative and a..^ satisfying + a 2 = 1 . Denote by \i. and L. 

the mean vector and covariance matrix, respectively, for observations on o). 

n ^ 

in R , and set I = a-iS-j + a 2 l 2 - Using algebra, it follows from the strong 

law of large numbers that, with probability 1, 


lim L(m, ) = 

N->oo ' 


~ Up, - Pjll^ 
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11m L(m«) = 
N-ko ^ 


_L 

“2 

T 


a 


Ip, - Pzll 


* Up, - P 2 II 


2 

7 


where the norm |1 H is now defined by 


1 |u j 1^ = for ueR*^ 

It is evident from eqs. (10) and (11) that L maps the sample means in to 
points in R^ which depend on the sample means and variances in R*^. From L, 
however, one can easily obtain u map, called the "derived map" and denoted 
by L', for which this is not the case. Specifically, we define 


L'(x) = 


“ ^2 I I ~ 1 1 


1 


7 


h] - ^211 L 


- ll™, 11^ + 2(ni, - 


( 12 ) 


One verifies immediately that L'(m^) = +1 and L'(m 2 ) = -1 » independent of 
n and the sample statistics in R*^. It follows that for each subset of the 
components of a data vector, the derived map appropriate for that component 
subset maps the sample mean vectors of w-j and C 02 in that component subset 
to +1 and -1, respectively. Consequently, one may regard the family of 
derived maps associated with subsets of components of data vectors as being 
statistically compatible for the objective of viewing the data to the extent 
that sample means of o)^ and 0)2 in subsets of components of data vectors are 
compatibly mapped. 


If both viewing the data and classification are objectives of mapping data 
n 1 

vectors in R to R , one may easily associate a classification rule with L' 
identical to that associated with L by observing that L(x) > 0 if and only if 


L'(x) 


1 _1 


m. 


m. 
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5. CONSTRUCTION OF THE MSE-OPTIMAL LDF AND THE DERIVED 
MAP FOR INCOMPLETE DATA VECTORS 


To successfully employ the MSE-optimal LDF or the derived map in mapping 
incomplete data vectors to r\ it may be essential to construct appropriate 
maps for incomplete data vectors as efficiently and accurately as possible. 
Three methods are offered for constructing the MSE-optimal LDF or the derived 
map appropriate for a subset of the components of a full data vector: a 
"straightforward" method, Kittler's method, and Golub's method. These 
methods require relatively little computation and storage, and require no 
"retraining", i.e., no direct dealing with the original labeled sample; hence 
they are well-suited for applications in which it is desirable to construct 
the appropriate map for each incomplete data vector as it is encountered. 

The underlying algebraic problem which must be solved to obtain the MSE- 
optimal LDF or the derived map appropriate for an incomplete data vector is 
described in the following section. Three procedures for solving this 
problem are offered and three methods are derived for constructing the 
desired MSE-optimal LDF and the derived map. A discussion of the relative 
advantages of these methods in applications, focusing on the relative 
efficiency and accuracy of the methods, concludes the section. 

5.1 THE ALGEBRAIC PROBLEM 

Suppose that A is a positive-definite symmetric k x k matrix and that u and 

V are vectors in R satisfying Au = v. For given indices i,, ••*, i., 

^ k-£ ^ 

8, < k, denote by v the vector in R obtained by deleting components 

i-|j •••» from v, and denote by K the (k - 8,) x (k - 8,) matrix obtained by 
deleting rows and columns i^ i^ from A. Consider the following 
problem: Find u*eR*^"^ which satisfies Au* = v. This algebraic problem is 

the fundamental problem which must be solved in constructing the MSE- 
optimal LDF or the derived map appropriate for an incomplete data vector. 
Three procedures for solving this problem follow; each procedure assumes 
that some information associated with the equation Au = v is initially 
available. 
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In the first procedure. It is assumed that A and v are available. The 
procedure consists of simply forming A and v and solving Au* » v In a straight 
forward manner. In formulating this procedure. It Is specified that the 
equation Is to be solved by first obtaining the Cholesky decomposition of A 
and then solving the resulting triangular systems. This method Is not only 
stable but also faster than competing methods such as Gaussla.'. elimination. 

For a full discussion of this method and related methods, see references 5 
and 6. 

PROCEDURE 1: 

a. Form v and A by deleting the components of v and the rows and columns 
of A indexed by 1^; j = 1, •••, i. 

b. Obtain In the usual way the (k - a) x (k - Ji) upper-triangular 
Cholesky factor R* satisfying R*^R* = A. 

T 

c. Obtain u* by solving in order the triangular systems R* z* = v and 
R*u* = z*. 

In the second procedure, it is assumed that A“^ and u are available. The 
basis of this procedure Is the following observation (ref. 2); If S. = 1 
and 1^ Is denoted simply by 1, then 

= D - j rr^ (13) 

where 

d Is the ith diagonal element of A"^ 

D Is the (k - 1) X (k - 1) matrix obtained by deleting the ith row and 

_i 

column from A ‘ 

r Is the (k - 1 ) -dimensional vector obtained from the ith column of A”^ 
by deleting d from it 

It follows that u* is given by 


u* = 


u 



(14) 
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where u. is the Hh component of u and u Is obtained by deleting u. from u. 
For general il, u* can be obtained from u and A” by repeated applications of 
eqs. (13) and (14). For convenience in describing Procedure 2, assume that 
il > i2 > ••• > i^. 

PROCEDURE 2: 

a. Set u(0) = u and A"^(0) = A“^ . 

b. For j = 1, •••, do the following: 

1 . Set 

u^. (j - 1 ) 

u(j) = u(j - 1) - - j j j -j ) • »"( j - 1) 

where 

u-j (j -1) "Js the component of u(j - 1) indexed by ij 

u(j - l)eR is obtained by deleting u. (j - 1) from u(j - 1) 

-1 

d(j - 1) is the diagonal entry of A (j - 1) indexed by i. 

J 

k “ i 

r(j - 1 )eR is obtained by deleting d(j - 1) from the column 
of A'^(j - 1) indexed by i . 

J 

2. If j = £, stop; otherwise, set 

A"’(j) = 0(j - 1) - r(j - l)r(J - I)''' 

where 

d(j - 1) and r( j - 1 ) are defined above 

D(j - 1) is the matrix obtained by deleting the ^ .th row and 
-1 

column from A (j - 1 ) 

c. Set u* = u(il). 

In the third procedure, it is assumed that one has available the upper- 
triangular Cholesky factor R satisfying R^R = A and the vector z satisfying 
R^z = V. Of course, u is determined by the equation Ru = z, which is in 
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upper-triangular form. The third procedure is based on the fact that, from 
R and z, one can obtain in an relatively efficient and stable manner a 
(k - t)-d1mens1onal equation in upper- triangular form whose solution is u*. 
Indeed, if ^ denotes the k ^ (k - t) matrix obtained by deleting columns 
h* ***» U R^R = A, R^z * V, and R^Ru^ = v = R^z. Now one can 

find a k X k orthogonal matrix P for which R = P( q )* where R’^' is a 
(k - £) X (k - a) upper-triangular matrix. (An efficient and stable way to 
obtain P is by composing appropriate Householder transformations. See 

ref. 7 for details.) Then R^R = R*^R*, and by setting P^z where 

k-£ ^ 

z*eR , one obtains 


r*Tr*u* = ( )^P^z = R*^z' 


It follows that u* is given by the equation R*u* = z*, which is in upper- 
triangular form. 


PROCEDURE 3: 

a. Form R by deleting the columns of R indexed by i.; j = 1, i. 

b. Obtain a k x k orthogonal matrix P for which the factorization 

R = P( n ) holds, where R* is a (k - £) x (k - £) upper-triangular 

^ T I 7* \ k-£ 

matrix; set P z = (j-**)* where z*eR . 

c. Obtain u* by solving the upper-triangular system R*u* = z*. 


5.2 THE THREE METHODS 

To demonstrate the relation of the algebraic problem just described to the 
problem of obtaining the MSE-optimal LDF or the derived map appropriate for 
an incomplete data vector, recall that the MSE-optimal LDF L for vectors in 
R*^ is given by L = a^$, where $(x) = (M and a is given by eq. (3). Since 
this is true for general n, one sees that if xeR denotes the vector 
obtained by deleting components i^, •••, i^ from a vector xeR*^, then the 
MSE-optimal LDF appropriate for x, denoted by L, is given by 

L = a*^^ 
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where 


♦C) =(i) 


and a* satisfies 


In this expression, 


and 


For convenience, write 


and 



( 15 ) 



It is clear that b is obtained from b by deleting from b the components 

indexed by i . for j = 1 , ••• , )l; similarly, S is obtained from S by deleting 

%} 

from S the rows and columns of S indexed by i . for j = 1 , •• • , £. Thus the 

\J 
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problem of determining L is fundamentally that of obtaining the solution 
a* of eq. (15), which is an algebraic problem of the type just described. 

Similarly, taking eq. (12) as the definition of the derived map for general 
n, observe that the derived map for x, denoted by L', is given by 



where 


A, 




i = 1, 2 






aT a aT a aT 


Xk^X 


and the norm | j | | is now given by 

||G||2 - for Gor"-’‘ 

Clearly, m^ and m 2 are obtained from and m 2 , respectively, by deleting 
components i^, •••, i^^ from m^ and m 2 i similarly, S|^ is obtained from Sj^ by 
deleting rows and columns i^, •••, from Sj^. Since L' is easily specified 

A ^ 1 A A_ 1 A 

once the vectors S|^ and S^^ m 2 are known, one sees that the problem of 
determining L* is fundamentally that of solving the equation 

= m. ; 1 = 1,2 (17) 


Once the role played by the algebraic problem of section 5.1 in the 
determination of L and L' has been established, three methods of obtaining 
these maps are immediately suggested. These methods can be formulated as 
follows . 
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THE STRAIGHTFORWARD METHOD OF OBTAINING L: 

a. Recall b and S from storage. 

b. Determine a* » §’^b by using Procedure 1 to solve eq. (15), taking 
k*n + 1,v»b, A»S, and u* * a*. 

c. Obtain C » a*^S. 

THE STRAIGHTFORWARD METHOD FOR OBTAINING L': 

a. Recall m^, m 2 * and S^ from storage. 

b. For 1 = 1, 2, determine u]f » ^W^^l using Procedure 1 to solve eq. (17), 

taking k = n, v - m^ , A = Sj^ and u* * u!f. 

c. Using u| = 1 = 1,2, determine llm^||^, Mm 2 il^* 

and 1 jm^ - 

/V 

d. Obtain L', as given by eq. (16). 

THE KITTLER METHOD FOR OBTAINING L: 

a. Recall a » S”^b and S“^ from storage. 

b. Determlna a* = S'^b by using Procedure 2 to solve eq. (15), taking 
k * n + 1, u = a, A’^ * S"\ and u* = a*. 

c. Obtain C = a*^$. 

THE KITTLER METHOD FOR OBTAINING L‘: 

a. Recall and Sy^"^ from storage. 

b. For 1 =1,2, determine ut * S^^m^ by using Procedure 2 to solve eq. (17), 

taking k = n, u = , A"^ = S^\ and u* = u|. 

c. Using ut = S^^m. , 1 = 1,2, determine ljm^||^, HmgH^, 
and I |m^ - ni 2 l 1^* 

d. Obtain L', as given by eq. (16). 
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THE GOLUB METHOD FOR OBTAINING L: 

a. Recall from storage the upper-triangular Cholesky factor R of S and 
the solution 2 of R^z = b. 

b. Determine a* * S”^b by using Procedure 3 to solve eq. (15) » taking 
k = n + 1 and u* = a*. 

c. Obtain L = a*^$. 

THE GOLUB METHOD FOR OBTAINING L' : 

a. Recall from storage the upper-triangular Cholesky factor R of and 

the solutions z^ of R^z^ = m^ for 1 =1,2. 

b. For 1 = 1,2, determine u| » using Procedure 3 to solve eq. (17), 

taking k = n, z = z^ , and u* = u|. 

c. Using uT = ^ = 1» 2, determine | [m^ 1 1 , l|m 2 ll , (m^ - m 2 ), 

and I |m^ - m 2 l 1^- 

d. Obtain L', as given by eq. (16). 

5.3 RELATIVE ADVANTAGES OF THE THREE METHODS 

In discussing the relative advantages of the three methods formulated in 
section 5.2, the focus is on efficiency and accuracy. None of the three 

methods presents any particularly stringent requirements of storage or 

preparatory computation. 

The efficiency of a method is usually reflected In the number of arithmetic 
operations required to implement it. Since the relative numbers of 
arithmetic operations required by the three methods are detennined directly 
by the relative numbers of arithmetic operations required by the three 
procedures in section 5.1, consider the arithmetic operations necessary to 
implement these procedures. In particular, since the number of additions 
required by procedures of this type is approximately the same as the number 
of multiplications required, the following formulas are offered which specify 
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the numbers of multiplications required by Procedures 1. 2, and 3, respectively, 
for any k. (When these procedures are used In the methods of section 5.2, k 
Is either n or (n + 1).) 


+ I (k - ^ (k - 1) (18) 

- l )^k - (t ■ 3(« - 1)(k - t) 

+ I (i - 1)2 + (k - I) + ^ (I - 1) + 1 (19) 


I (t - l)(k - + I (k - 1)^ + f - t) + ^ (k - t) 

It is apparent from the highest-order terms in these formulas that if k Is 
very large and If t Is small relative to k, then the Kittler method requires 
the fewest multiplications of the three, followed by the Gk)lub method and the 
straightforward method in that order. In fact, 1n these circumstances, If 
t ■ 1, then the numbers of multiplications required by the three methods are 
0(k), O(k^), and O(k^), respectively. If t > 1 but Is still small relative to 
k, then both the Kittler method and the Golub method require 0(k - multi- 
plications, while the straightforward method requires 0(k - multiplications. 
If t = 2, then the Kittler method requires fewer multiplications than the 
Golub method by about a factor of 12; however, as I grows, this advantage 
quickly drops to a factor slightly greater than 3. 

If k is not too large or if the size of t Is significant relative to k, then 
the order of the numbers of multiplications required by the three methods 
changes. To Illustrate the variance of this order with different values of k 
and t, see table I, In which the order is Indicated on the left (''K,*' "G," 
and "S" represent "Kittler," "Golub," and "straightforward," respectively), 
and the values of k are listed across the top. The lower right number In each 
entry Is the value of I for which the order on the left first occurred with 



the value of k given above; the upper left number In each entry Is the fraction 
t/k. It should be noted that* except for fairly small k» the orders (K» G* 

S), (K, S, G), (S, K, G), and (S, G, K) appear In order as I ranges from 1 to 
(k - 1). Furthermore* If k Is large, then these orders appear at fairly 
well -sped fled values of £/k. Specifically* the order (K* S* G) replaces the 
order (K, G* S) when l Is about 10 percent of k; the order (S* K, G) replaces 
the order (K* S* 6) when l Is about 21 percent of k; and the order (S, G* K) 
replaces the order (S* K* G) when I Is about 58 percent of k. Figures 1 
through 5 indicate for five different values of k the relative sizes of the 
numbers of multiplications required by the three methods as I ranges from 1 
to {k - 1 ). 

In gaging the accuracy of a n^thod, one should consider not only the nun^r 
of arithmetic operations required by the method but also the stability of 
the method. Loosely speaking, a method Is said to be stable If small errors 
Introduced In the course of the computation do not compound themselves to an 
unreasonable degree as the computation proceeds; otherwise* the method Is 
said to be unstable. Somewhat more strictly speaking* stable methods 
generally exhibit error growth which Is linear In the nunfcer of arithmetic 
operations performed, while the error growth associated with unstable methods 
Is exponential In the mmiber of arithmetic operations performed. To quote 
reference 8* "linear growth Is normal* usually unavoidable, and not dangerous; 
exponential growth may, however* be disastrous and should be avoided at all 
costs." 

The instability of a method may become especially serious when there are bad 
features inherent in a particular problem to which the method is applied. 

Here, the basic problem under consideration is that posed In section 5.1, 
that of solving the linear equation Au* * v using information about the 
equation Au « v. A linear equation Involving a positive-definite symmetric 
matrix is said to be ill-conditioned if the condition mm*er, defined to be 
the ratio of the largest eigenvalue to the smallest^, is large. The practical 


^For a definition of the condition ntxnber for a general linear system, see 
reference 6. 
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Importance of 111 'Conditioning Is that small errors Incurred In computing the 
approximate solution of an Ill-conditioned linear equation may result In 
large errors In the approximate solution. Thus, procedures for solving an 
111-condltior.ed linear equation must be chosen with care In order that they 
yield approximate solutions which are meaningful. In particular, unstable 
procedures should be avoided In solving Ill-conditioned linear equations. 

The linear equations (15) and (17), vdilch must be solved by the procedures 
of section 5.1 to obtain L and L', are likely to be Ill-conditioned In many 
applications. (The same Is true of the "parent” equations Sa « b and 

A 

S^Ui > 1 » 1, 2.) Indeed, the matrices S and will be Ill-conditioned 

If the "Incomplete" labeled training vectors x^, k * 1, •••, N, "nearly" lie 

n*fi. ^ 

In soiT« proper subspace of R . This will occur, for example. If two or 

A 

more of the ccxnponents of the "Incomplete" vector randcxn variable X are 
highly correlated. Consequently, potentially unstable procedures sho*jld be 
used In the solution of eqs. (15) and (17) only If It Is ascertained that 
these equations (and, perhaps, their "parent" equations as well) are not 
too Ill-conditioned. 

Procedures 1 and 3 of section 5.1 are known to be stable. Indeed, Procedure 1 
Involves only Cholesky decomposition and the solution of two triangular 
syst«ns. Both Cholesky decomposition and the solution of triangular systems 
are stable, and the latter can usually be carried out with a high degree of 
accuracy (ref. 7). In Procedure 3, the formation of the initial Cholesky 
factor R of A Is stable, as Is the solution of the upper-triangular systw 
R*y* s 2 *. Also, It was observed earlier that the factorization R « ^(o*) 
can be obtained In an efficient and stable way by constructing P as a 
composition of Householder transformations (ref. 9). 

On the other hand. Procedure 2 appears to be potentially unstable. The basis 
of Procedure 2 Is the repeated application of the matrix-inverse-update 
formula uf step 2, which Is derived from the Sherman-Morrison formula (ref. 9), 
for updating the inverse of a matrix following a rank 1 change. The Sherman- 
Morrison formula is known to exhibit Instability in many applications. 
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Indeed, the potential for difficulties In Procedure 2 Is apparent In the 
formula of step 2: If the subtraction called for by the formula Is subject 
to numerical error, the resulting approximate Inverse can actually fall to 
be positive-definitel If only one component of a data vector Is missing 
(t • 1), then the Inverse Is not updated and the potential Instabilities do 
not materialize. 

The salient points of this discussion are sunmarlzed and conclusions are 
drawn in the following: 

a. The straightforward method Is always stable. It Is preferred over 
the other two methods for reasons of efficiency as well as accuracy 
w^en more than about 21 percent of the vector ccxnponents are missing. 

It is more efficient than the Golub method when more than about 

10 percent of the components are missing. If the number of components 
of complete vectors Is not large (no greater than 30 to 40} then the 
straightforward method Is competitive In efficiency with the Golub 
method for any nianber of missing components. 

b. The Kittler method Is more efficient than either the straightforward 
method or the Golub method when no more than about 21 percent of the 
vector components are missing. However, because of the potential 
Instabilities Inherent In this nwthod. It should be used only when 
speed of computation Is of overriding concern and eqs. (15) and (17) 
and their "parent” equations are well -conditioned. When only one 
component Is missing, the Kittler method can be safely used and results 
In considerable savings In computation time. 

c. The Golub method is always stable. As the number of. components of 
ccxnplete vectors becomes greater than 40, It offers Increasingly 
significant benefits In efficiency over the straightforward r?»thod 
when no more than about 10 percent of the vector components are 
missing. After more than a few components are missing, the advantage 
In efficiency of the Kittler method over the Golub method drops to a 
factor slightly greater than 3. The Golub method becomes more efficient 
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then the Kittler method after about 58 percent of the components are 
missing. Of course, the straightforward method Is by far the most 
efficient method at this point. 



TABLE I.- VARIANCE OF ORDER 
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Figure 5.- Comparison of multiplications when k = 1000. 
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